Multiple Multistep Time Series Analysis

By Chuanyun (Clara) Zang

The analysis aimed to solve a Kaggle competition problem – Web Traffic Time Series Forecasting. See https://www.kaggle.com/c/web-traffic-time-series-forecasting.

Long Short Term Memory network and some basic statistic method is used in the analysis.

Set up libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.callbacks import ModelCheckpoint  
Using TensorFlow backend.

Import Datasets

In [92]:
data = pd.read_csv('train_1.csv')
data.head()
Out[92]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 2NE1_zh.wikipedia.org_all-access_spider 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 2PM_zh.wikipedia.org_all-access_spider 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 3C_zh.wikipedia.org_all-access_spider 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 4minute_zh.wikipedia.org_all-access_spider 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0

5 rows × 551 columns

In [30]:
data.info()
data.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145063 entries, 0 to 145062
Columns: 551 entries, Page to 2016-12-31
dtypes: float64(550), object(1)
memory usage: 609.8+ MB
Out[30]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
145058 Underworld_(serie_de_películas)_es.wikipedia.o... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 13.0 12.0 13.0 3.0 5.0 10.0
145059 Resident_Evil:_Capítulo_Final_es.wikipedia.org... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145060 Enamorándome_de_Ramón_es.wikipedia.org_all-acc... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145061 Hasta_el_último_hombre_es.wikipedia.org_all-ac... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
145062 Francisco_el_matemático_(serie_de_televisión_d... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 551 columns

In [104]:
key = pd.read_csv('key_1.csv')
key_split = key.copy()
key_split['date'] = key_split.Page.apply(lambda a: a[-10:])
key_split['Page'] = key_split.Page.apply(lambda a: a[:-11])
key_split.head()
Out[104]:
Page Id date
0 !vote_en.wikipedia.org_all-access_all-agents bf4edcf969af 2017-01-01
1 !vote_en.wikipedia.org_all-access_all-agents 929ed2bf52b9 2017-01-02
2 !vote_en.wikipedia.org_all-access_all-agents ff29d0f51d5c 2017-01-03
3 !vote_en.wikipedia.org_all-access_all-agents e98873359be6 2017-01-04
4 !vote_en.wikipedia.org_all-access_all-agents fa012434263a 2017-01-05

Define metric functions

In [105]:
from keras import backend as K

def smape(y_true, y_pred):
    diff = K.abs((y_true - y_pred) / K.clip(K.abs(y_true)+K.abs(y_pred),
                                            K.epsilon(),
                                            None))
    return 200. * K.mean(diff, axis=-1)

def smape_test(y_true, y_pred):
    denominator = (y_true+y_pred)/200.0
    diff = np.abs(y_true-y_pred)/denominator
    diff[denominator==0] = 0.0
    return np.mean(diff)

Exploratory Analysis

In [48]:
data0 = data.copy()
#data0.describe(include='all')
In [52]:
data0.iloc[:,1:].isnull().sum(axis=0).describe()
#data0.isnull().sum(axis=1).describe()
Out[52]:
count      550.000000
mean     11259.874545
std       5275.772841
min       3189.000000
25%       6614.500000
50%      10560.500000
75%      15792.500000
max      20816.000000
dtype: float64
In [54]:
((data0.isnull().sum(axis=1) == 0)).sum()
Out[54]:
117277
In [37]:
plt.hist(data0.isnull().sum(axis=1))
plt.xlabel('missing days')
plt.ylabel('number of Pages')
plt.title('Distribution of Pages along Missing Values')
Out[37]:
<matplotlib.text.Text at 0x120e7f990>
In [71]:
data0 = data0.fillna(0)
In [64]:
days = range(550)
def plot_entry(idx):
    data_temp = data0.iloc[idx,1:]
    fig = plt.figure(1,figsize=(10,5))
    plt.plot(days,data_temp)
    plt.xlabel('day')
    plt.ylabel('views')
    plt.title(data0.iloc[idx,0])
    
    plt.show()
In [127]:
for idx in [0, 5, 50, 10000, 50000]:
    print(idx)
    print(data0.iloc[idx,0])
    plot_entry(idx)    
0
2NE1_zh.wikipedia.org_all-access_spider
5
5566_zh.wikipedia.org_all-access_spider
50
Fate/Zero_zh.wikipedia.org_all-access_spider
10000
Ohi_Day_en.wikipedia.org_desktop_all-agents
50000
Suicide_Squad_(Film)_de.wikipedia.org_all-access_spider
In [73]:
plot_entry(4)

Preprocess Data

Save true values for checking model performence later

In [93]:
data = data.fillna(0)
#X_train = data[data.columns[1:-120]]
y_train_true = data[list(data.columns[-120:-60])]
#X_validate = data1[list(data1.columns[61:-60])]
y_validate_true = data[list(data.columns[-60:])]
#X_test = data1[list(data1.columns[121:])]

Work on a copy of the training data

In [94]:
data1 = data.copy()

data1['index1'] = data1.index
data1 = data1[['index1']+list(data1.columns)[1:-1]]
data1.head()
Out[94]:
index1 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 0 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 32.0 63.0 15.0 26.0 14.0 20.0 22.0 19.0 18.0 20.0
1 1 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 17.0 42.0 28.0 15.0 9.0 30.0 52.0 45.0 26.0 20.0
2 2 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 3.0 1.0 1.0 7.0 4.0 4.0 6.0 3.0 4.0 17.0
3 3 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 32.0 10.0 26.0 27.0 16.0 11.0 17.0 19.0 10.0 11.0
4 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 48.0 9.0 25.0 13.0 3.0 11.0 27.0 13.0 36.0 10.0

5 rows × 551 columns

Transform all values in dataset by using natural logarithm

In [95]:
### Take log(x+1)
data1[data1.columns[1:]] = data1[data1.columns[1:]].apply(lambda x: np.log(x+1))
data1.head()
Out[95]:
index1 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2016-12-22 2016-12-23 2016-12-24 2016-12-25 2016-12-26 2016-12-27 2016-12-28 2016-12-29 2016-12-30 2016-12-31
0 0 2.944439 2.484907 1.791759 2.639057 2.708050 2.302585 2.302585 3.135494 3.295837 ... 3.496508 4.158883 2.772589 3.295837 2.708050 3.044522 3.135494 2.995732 2.944439 3.044522
1 1 2.484907 2.708050 2.772589 2.944439 2.484907 2.639057 3.135494 2.484907 2.397895 ... 2.890372 3.761200 3.367296 2.772589 2.302585 3.433987 3.970292 3.828641 3.295837 3.044522
2 2 0.693147 0.000000 0.693147 0.693147 0.000000 1.609438 0.000000 1.386294 1.609438 ... 1.386294 0.693147 0.693147 2.079442 1.609438 1.609438 1.945910 1.386294 1.609438 2.890372
3 3 3.583519 2.639057 2.397895 4.553877 1.609438 3.295837 2.708050 2.302585 2.484907 ... 3.496508 2.397895 3.295837 3.332205 2.833213 2.484907 2.890372 2.995732 2.397895 2.484907
4 4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 3.891820 2.302585 3.258097 2.639057 1.386294 2.484907 3.332205 2.639057 3.610918 2.397895

5 rows × 551 columns

Define X_train, y_train, X_test, y_test

In [96]:
X_train = data1[data1.columns[1:-120]]
y_train = data1[list(data1.columns[-120:-60])]
X_validate = data1[list(data1.columns[61:-60])]
y_validate = data1[list(data1.columns[-60:])]
X_test = data1[list(data1.columns[121:])]

print('X_train: ', X_train.shape)
print(X_train.head(1))
print('y_train: ', y_train.shape)
print(y_train.head(1))
print('X_validate: ', X_validate.shape)
print(X_validate.head(1))
print('y_validate: ', y_validate.shape)
print(y_validate.head(1))
print('X_test: ', X_test.shape)
print(X_test.head(1)) 
('X_train: ', (145063, 430))
   2015-07-01  2015-07-02  2015-07-03  2015-07-04  2015-07-05  2015-07-06  \
0    2.944439    2.484907    1.791759    2.639057     2.70805    2.302585   

   2015-07-07  2015-07-08  2015-07-09  2015-07-10     ...      2016-08-24  \
0    2.302585    3.135494    3.295837    3.218876     ...        3.044522   

   2016-08-25  2016-08-26  2016-08-27  2016-08-28  2016-08-29  2016-08-30  \
0     2.70805    3.713572    2.772589    2.944439    3.295837    2.197225   

   2016-08-31  2016-09-01  2016-09-02  
0    3.258097    3.091042    3.044522  

[1 rows x 430 columns]
('y_train: ', (145063, 60))
   2016-09-03  2016-09-04  2016-09-05  2016-09-06  2016-09-07  2016-09-08  \
0    3.258097    2.995732    3.178054    2.944439    2.995732    2.944439   

   2016-09-09  2016-09-10  2016-09-11  2016-09-12     ...      2016-10-23  \
0    4.025352    2.833213    4.189655    2.484907     ...        3.295837   

   2016-10-24  2016-10-25  2016-10-26  2016-10-27  2016-10-28  2016-10-29  \
0    3.258097    2.833213    2.995732    3.044522    2.564949    2.995732   

   2016-10-30  2016-10-31  2016-11-01  
0    3.931826    2.833213    3.433987  

[1 rows x 60 columns]
('X_validate: ', (145063, 430))
   2015-08-30  2015-08-31  2015-09-01  2015-09-02  2015-09-03  2015-09-04  \
0    2.302585    2.397895    2.302585    2.484907    2.484907    2.484907   

   2015-09-05  2015-09-06  2015-09-07  2015-09-08     ...      2016-10-23  \
0    2.302585    2.772589    1.791759    2.397895     ...        3.295837   

   2016-10-24  2016-10-25  2016-10-26  2016-10-27  2016-10-28  2016-10-29  \
0    3.258097    2.833213    2.995732    3.044522    2.564949    2.995732   

   2016-10-30  2016-10-31  2016-11-01  
0    3.931826    2.833213    3.433987  

[1 rows x 430 columns]
('y_validate: ', (145063, 60))
   2016-11-02  2016-11-03  2016-11-04  2016-11-05  2016-11-06  2016-11-07  \
0    2.944439    3.258097     2.70805    3.044522    2.197225    4.219508   

   2016-11-08  2016-11-09  2016-11-10  2016-11-11     ...      2016-12-22  \
0    2.639057     3.73767    2.397895    3.091042     ...        3.496508   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0    4.158883    2.772589    3.295837     2.70805    3.044522    3.135494   

   2016-12-29  2016-12-30  2016-12-31  
0    2.995732    2.944439    3.044522  

[1 rows x 60 columns]
('X_test: ', (145063, 430))
   2015-10-29  2015-10-30  2015-10-31  2015-11-01  2015-11-02  2015-11-03  \
0    2.079442    2.302585    2.397895    3.218876     1.94591     1.94591   

   2015-11-04  2015-11-05  2015-11-06  2015-11-07     ...      2016-12-22  \
0    2.197225    2.833213    2.639057    2.397895     ...        3.496508   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0    4.158883    2.772589    3.295837     2.70805    3.044522    3.135494   

   2016-12-29  2016-12-30  2016-12-31  
0    2.995732    2.944439    3.044522  

[1 rows x 430 columns]

Normalize X_train/validate/test

In [97]:
### Normalization X so that visits in each day are in the same range
## ? yes

#range_train = (X_train.max()-X_train.min()).replace('0', '1').astype('float64')
X_train_norm = (X_train - X_train.min())/(X_train.max()-X_train.min())
#y_train_norm = (y_train - y_train.mean())/y_train.std()
#X_validate_norm = (X_validate.max()-X_validate.min()).replace('0', '1').astype('float64')
X_validate_norm = (X_validate - X_validate.min())/(X_validate.max()-X_validate.min())

X_test_norm = (X_test - X_test.min())/(X_test.max()-X_test.min())

print('X_train_norm: ', X_train_norm.shape, X_train_norm.head(1))
print('y_validate: ', y_validate.shape, y_validate.head(1))
('X_train_norm: ', (145063, 430),    2015-07-01  2015-07-02  2015-07-03  2015-07-04  2015-07-05  2015-07-06  \
0    0.174951    0.147488    0.106718    0.156779    0.160724    0.135998   

   2015-07-07  2015-07-08  2015-07-09  2015-07-10     ...      2016-08-24  \
0     0.13649    0.187019    0.196053    0.191357     ...        0.179108   

   2016-08-25  2016-08-26  2016-08-27  2016-08-28  2016-08-29  2016-08-30  \
0    0.160072     0.22039    0.162604    0.172248    0.191679    0.127354   

   2016-08-31  2016-09-01  2016-09-02  
0    0.188809    0.178959    0.177051  

[1 rows x 430 columns])
('y_validate: ', (145063, 60),    2016-11-02  2016-11-03  2016-11-04  2016-11-05  2016-11-06  2016-11-07  \
0    2.944439    3.258097     2.70805    3.044522    2.197225    4.219508   

   2016-11-08  2016-11-09  2016-11-10  2016-11-11     ...      2016-12-22  \
0    2.639057     3.73767    2.397895    3.091042     ...        3.496508   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0    4.158883    2.772589    3.295837     2.70805    3.044522    3.135494   

   2016-12-29  2016-12-30  2016-12-31  
0    2.995732    2.944439    3.044522  

[1 rows x 60 columns])

Reshape datasets for LSTM model

In [98]:
## reshape after normalized
X_train1 = X_train_norm.values.reshape(X_train.shape[0], 1, X_train.shape[1])
X_validate1 = X_validate_norm.values.reshape(X_validate.shape[0], 1, X_validate.shape[1])
y_train1 = y_train.values
y_validate1 = y_validate.values

X_test1 = X_test_norm.values.reshape(X_test.shape[0], 1, X_test.shape[1])

print('X_train: ', X_train1.shape)
print(X_train1[0])
print('y_train: ', y_train1.shape)
print(y_train1[0])
print('X_validate: ', X_validate1.shape)
print(X_validate1[0])
print('y_validate: ', y_validate1.shape)
print(y_validate1[0])
print('X_test: ', X_test1.shape)
print(X_test1[0]) 
('X_train: ', (145063, 1, 430))
[[ 0.1749505   0.14748829  0.10671771  0.15677891  0.16072352  0.1359981
   0.13648972  0.18701936  0.19605315  0.19135733  0.17835967  0.14253836
   0.16084182  0.16507246  0.13087165  0.16870372  0.13110384  0.13136641
   0.16914003  0.12357009  0.14792728  0.14283569  0.18130718  0.17588417
   0.16526575  0.16089385  0.23259556  0.14268269  0.16537567  0.17596468
   0.1315553   0.10718775  0.13721265  0.12409067  0.15673627  0.13686737
   0.12367142  0.09599312  0.14806191  0.14275976  0.10662807  0.13766146
   0.13721981  0.13730494  0.13719536  0.1573116   0.09582818  0.16477521
   0.193757    0.13742981  0.10687837  0.11614201  0.18177179  0.08248949
   0.16098664  0.23004407  0.10722024  0.10730163  0.15835046  0.09639062
   0.13774117  0.14240646  0.13779412  0.14854105  0.14881773  0.14900525
   0.13787578  0.16543123  0.10666318  0.14278722  0.124226    0.09631636
   0.13128514  0.13741578  0.14304402  0.11602505  0.15732986  0.16946422
   0.11640199  0.19286508  0.13789799  0.14798676  0.1522172   0.13116853
   0.16128589  0.11649857  0.11655509  0.14869661  0.16173441  0.11564566
   0.14375931  0.181262    0.12413443  0.1656729   0.13139345  0.1654608
   0.1068131   0.13140894  0.13116549  0.10689135  0.1485293   0.30507131
   0.21205271  0.11565273  0.1571899   0.1311817   0.13746316  0.14880978
   0.19726235  0.17566732  0.08214372  0.10637282  0.15189752  0.11526443
   0.16908967  0.17960492  0.13692581  0.14240829  0.1474294   0.14757892
   0.12414341  0.13774897  0.14347865  0.19269644  0.11530924  0.11539898
   0.13069276  0.16864311  0.15794077  0.1437603   0.143284    0.11548926
   0.10669916  0.18154005  0.11617931  0.23163804  0.13779829  0.13733338
   0.15218487  0.14790683  0.17198809  0.1653499   0.16233559  0.14828698
   0.27317653  0.14756906  0.15282483  0.14835075  0.16202787  0.16601993
   0.15296927  0.2774438   0.10616367  0.18645674  0.22902395  0.25874715
   0.20361888  0.21270251  0.18199626  0.15208629  0.19376125  0.13781584
   0.24923818  0.18252062  0.18042325  0.1313761   0.18947212  0.15836606
   0.16981101  0.21390251  0.2174537   0.14944177  0.17705998  0.15363803
   0.19304376  0.20570796  0.20000107  0.22949911  0.21458859  0.23925902
   0.14829055  0.19757422  0.15830417  0.17688425  0.13862489  0.16981378
   0.11666234  0.17861415  0.1821563   0.17909927  0.18814931  0.20667867
   0.16298657  0.17001018  0.18702305  0.1657219   0.16641872  0.19719271
   0.16990173  0.15785692  0.19901048  0.17547126  0.15714247  0.20880643
   0.20683369  0.17023495  0.21964641  0.17647747  0.13692539  0.161763
   0.14295795  0.19315917  0.13190209  0.16648294  0.17604416  0.14251946
   0.18967065  0.17253374  0.14850996  0.19738942  0.16179036  0.13097343
   0.15224452  0.13680682  0.14884598  0.2143377   0.17384678  0.20395019
   0.14839282  0.13695364  0.16081427  0.18361019  0.15311611  0.14885917
   0.15806738  0.14845041  0.15678801  0.16898099  0.15748182  0.17855647
   0.18495895  0.16172342  0.14790051  0.21220212  0.17519152  0.22395772
   0.16440397  0.1072697   0.18415005  0.24166797  0.13724154  0.18197985
   0.17293248  0.17668397  0.13262193  0.1386156   0.17515236  0.13755293
   0.1441339   0.16307801  0.17421833  0.11736392  0.17855995  0.16036999
   0.15019639  0.15555971  0.15031646  0.13319613  0.16843522  0.15063944
   0.18401339  0.24554049  0.1499573   0.17797801  0.17527899  0.15611468
   0.16380461  0.15899123  0.13776116  0.37218346  0.31512953  0.27855934
   0.22070877  0.2907943   0.25654152  0.18431133  0.24270195  0.26308554
   0.17350384  0.17385293  0.19076587  0.16986063  0.1889611   0.17593662
   0.18788992  0.22974757  0.11749823  0.2084149   0.1731421   0.19370074
   0.22243389  0.17954889  0.16600966  0.16712878  0.20415758  0.1761372
   0.16867763  0.15739036  0.18183055  0.18848693  0.18032138  0.14913427
   0.2354503   0.18693329  0.22015366  0.19013928  0.18476378  0.19101486
   0.18758902  0.16895124  0.17763328  0.21384051  0.16961375  0.15371933
   0.16704595  0.1580282   0.16348454  0.14282874  0.18473474  0.18178057
   0.17938466  0.16303775  0.15348316  0.16542947  0.17216165  0.16903216
   0.18458878  0.19993282  0.15914151  0.15001657  0.16549055  0.16022506
   0.17563056  0.17508862  0.1430875   0.14884687  0.16094403  0.17436882
   0.15976347  0.15626811  0.17217156  0.16585537  0.16305683  0.32636731
   0.13108746  0.24578628  0.19618241  0.18657862  0.13086213  0.18777221
   0.16527049  0.25253241  0.14842151  0.17513853  0.18928426  0.15123462
   0.18271253  0.17182002  0.16452636  0.1677552   0.17583848  0.18566772
   0.16787989  0.20693976  0.28416186  0.24024517  0.22681607  0.17192155
   0.17556376  0.16532075  0.17633225  0.16098737  0.1645804   0.16394873
   0.19094443  0.18076215  0.16587356  0.19201778  0.14330309  0.14707114
   0.16337993  0.16036115  0.1926234   0.18330634  0.18977765  0.14334055
   0.16692254  0.16636347  0.22608373  0.16254578  0.17263986  0.22132626
   0.13508434  0.16836028  0.18418269  0.20892715  0.17728866  0.19166771
   0.22495741  0.16144024  0.1797902   0.15078664  0.14369467  0.22969099
   0.22228992  0.21618164  0.15558795  0.20209928  0.17834107  0.20684577
   0.17910771  0.16007159  0.22038974  0.16260375  0.1722482   0.19167851
   0.12735448  0.18880855  0.17895949  0.17705119]]
('y_train: ', (145063, 60))
[ 3.25809654  2.99573227  3.17805383  2.94443898  2.99573227  2.94443898
  4.02535169  2.83321334  4.18965474  2.48490665  2.48490665  2.63905733
  3.04452244  3.09104245  2.63905733  3.21887582  3.04452244  2.63905733
  3.49650756  2.83321334  2.39789527  2.63905733  3.80666249  2.89037176
  2.63905733  4.29045944  3.71357207  2.99573227  2.7080502   2.63905733
  2.56494936  2.7080502   2.39789527  3.29583687  2.63905733  3.13549422
  2.7080502   3.17805383  2.56494936  2.19722458  3.93182563  2.63905733
  2.39789527  2.83321334  2.7080502   2.39789527  3.21887582  2.39789527
  3.04452244  2.39789527  3.29583687  3.25809654  2.83321334  2.99573227
  3.04452244  2.56494936  2.99573227  3.93182563  2.83321334  3.4339872 ]
('X_validate: ', (145063, 1, 430))
[[ 0.13774117  0.14240646  0.13779412  0.14854105  0.14881773  0.14900525
   0.13787578  0.16543123  0.10666318  0.14278722  0.124226    0.09631636
   0.13128514  0.13741578  0.14304402  0.11602505  0.15732986  0.16946422
   0.11640199  0.19286508  0.13789799  0.14798676  0.1522172   0.13116853
   0.16128589  0.11649857  0.11655509  0.14869661  0.16173441  0.11564566
   0.14375931  0.181262    0.12413443  0.1656729   0.13139345  0.1654608
   0.1068131   0.13140894  0.13116549  0.10689135  0.1485293   0.30507131
   0.21205271  0.11565273  0.1571899   0.1311817   0.13746316  0.14880978
   0.19726235  0.17566732  0.08214372  0.10637282  0.15189752  0.11526443
   0.16908967  0.17960492  0.13692581  0.14240829  0.1474294   0.14757892
   0.12414341  0.13774897  0.14347865  0.19269644  0.11530924  0.11539898
   0.13069276  0.16864311  0.15794077  0.1437603   0.143284    0.11548926
   0.10669916  0.18154005  0.11617931  0.23163804  0.13779829  0.13733338
   0.15218487  0.14790683  0.17198809  0.1653499   0.16233559  0.14828698
   0.27317653  0.14756906  0.15282483  0.14835075  0.16202787  0.16601993
   0.15296927  0.2774438   0.10616367  0.18645674  0.22902395  0.25874715
   0.20361888  0.21270251  0.18199626  0.15208629  0.19376125  0.13781584
   0.24923818  0.18252062  0.18042325  0.1313761   0.18947212  0.15836606
   0.16981101  0.21390251  0.2174537   0.14944177  0.17705998  0.15363803
   0.19304376  0.20570796  0.20000107  0.22949911  0.21458859  0.23925902
   0.14829055  0.19757422  0.15830417  0.17688425  0.13862489  0.16981378
   0.11666234  0.17861415  0.1821563   0.17909927  0.18814931  0.20667867
   0.16298657  0.17001018  0.18702305  0.1657219   0.16641872  0.19719271
   0.16990173  0.15785692  0.19901048  0.17547126  0.15714247  0.20880643
   0.20683369  0.17023495  0.21964641  0.17647747  0.13692539  0.161763
   0.14295795  0.19315917  0.13190209  0.16648294  0.17604416  0.14251946
   0.18967065  0.17253374  0.14850996  0.19738942  0.16179036  0.13097343
   0.15224452  0.13680682  0.14884598  0.2143377   0.17384678  0.20395019
   0.14839282  0.13695364  0.16081427  0.18361019  0.15311611  0.14885917
   0.15806738  0.14845041  0.15678801  0.16898099  0.15748182  0.17855647
   0.18495895  0.16172342  0.14790051  0.21220212  0.17519152  0.22395772
   0.16440397  0.1072697   0.18415005  0.24166797  0.13724154  0.18197985
   0.17293248  0.17668397  0.13262193  0.1386156   0.17515236  0.13755293
   0.1441339   0.16307801  0.17421833  0.11736392  0.17855995  0.16036999
   0.15019639  0.15555971  0.15031646  0.13319613  0.16843522  0.15063944
   0.18401339  0.24554049  0.1499573   0.17797801  0.17527899  0.15611468
   0.16380461  0.15899123  0.13776116  0.37218346  0.31512953  0.27855934
   0.22070877  0.2907943   0.25654152  0.18431133  0.24270195  0.26308554
   0.17350384  0.17385293  0.19076587  0.16986063  0.1889611   0.17593662
   0.18788992  0.22974757  0.11749823  0.2084149   0.1731421   0.19370074
   0.22243389  0.17954889  0.16600966  0.16712878  0.20415758  0.1761372
   0.16867763  0.15739036  0.18183055  0.18848693  0.18032138  0.14913427
   0.2354503   0.18693329  0.22015366  0.19013928  0.18476378  0.19101486
   0.18758902  0.16895124  0.17763328  0.21384051  0.16961375  0.15371933
   0.16704595  0.1580282   0.16348454  0.14282874  0.18473474  0.18178057
   0.17938466  0.16303775  0.15348316  0.16542947  0.17216165  0.16903216
   0.18458878  0.19993282  0.15914151  0.15001657  0.16549055  0.16022506
   0.17563056  0.17508862  0.1430875   0.14884687  0.16094403  0.17436882
   0.15976347  0.15626811  0.17217156  0.16585537  0.16305683  0.32636731
   0.13108746  0.24578628  0.19618241  0.18657862  0.13086213  0.18777221
   0.16527049  0.25253241  0.14842151  0.17513853  0.18928426  0.15123462
   0.18271253  0.17182002  0.16452636  0.1677552   0.17583848  0.18566772
   0.16787989  0.20693976  0.28416186  0.24024517  0.22681607  0.17192155
   0.17556376  0.16532075  0.17633225  0.16098737  0.1645804   0.16394873
   0.19094443  0.18076215  0.16587356  0.19201778  0.14330309  0.14707114
   0.16337993  0.16036115  0.1926234   0.18330634  0.18977765  0.14334055
   0.16692254  0.16636347  0.22608373  0.16254578  0.17263986  0.22132626
   0.13508434  0.16836028  0.18418269  0.20892715  0.17728866  0.19166771
   0.22495741  0.16144024  0.1797902   0.15078664  0.14369467  0.22969099
   0.22228992  0.21618164  0.15558795  0.20209928  0.17834107  0.20684577
   0.17910771  0.16007159  0.22038974  0.16260375  0.1722482   0.19167851
   0.12735448  0.18880855  0.17895949  0.17705119  0.18840775  0.17345892
   0.18500737  0.17172817  0.17598201  0.17372527  0.23892249  0.1670451
   0.24678391  0.14733773  0.14618693  0.15412051  0.17756106  0.18308784
   0.15506896  0.18921725  0.17836726  0.15647352  0.20607065  0.16721371
   0.14242292  0.1556792   0.22411307  0.169794    0.15485257  0.25240864
   0.21915716  0.17768769  0.1593019   0.15534157  0.15020576  0.15847502
   0.13873197  0.19023071  0.15659556  0.18495221  0.15937672  0.18660596
   0.152712    0.13164849  0.2329353   0.15640763  0.14102001  0.16648381
   0.15826393  0.14069367  0.18957621  0.14178803  0.18096532  0.14131577
   0.1941288   0.19109058  0.16661067  0.17651352  0.1799318   0.15262093
   0.17690634  0.23195875  0.16623496  0.20205947]]
('y_validate: ', (145063, 60))
[ 2.94443898  3.25809654  2.7080502   3.04452244  2.19722458  4.21950771
  2.63905733  3.73766962  2.39789527  3.09104245  2.63905733  2.19722458
  2.77258872  2.7080502   2.56494936  1.94591015  2.48490665  2.39789527
  3.76120012  3.09104245  3.21887582  2.7080502   2.48490665  5.32300998
  2.7080502   3.8286414   3.52636052  3.36729583  2.94443898  2.7080502
  3.87120101  2.77258872  2.7080502   2.94443898  3.04452244  2.7080502
  2.83321334  2.7080502   3.04452244  4.11087386  3.13549422  2.77258872
  2.89037176  2.99573227  2.94443898  3.09104245  3.09104245  3.87120101
  4.18965474  2.89037176  3.49650756  4.15888308  2.77258872  3.29583687
  2.7080502   3.04452244  3.13549422  2.99573227  2.94443898  3.04452244]
('X_test: ', (145063, 1, 430))
[[ 0.12414341  0.13774897  0.14347865  0.19269644  0.11530924  0.11539898
   0.13069276  0.16864311  0.15794077  0.1437603   0.143284    0.11548926
   0.10669916  0.18154005  0.11617931  0.23163804  0.13779829  0.13733338
   0.15218487  0.14790683  0.17198809  0.1653499   0.16233559  0.14828698
   0.27317653  0.14756906  0.15282483  0.14835075  0.16202787  0.16601993
   0.15296927  0.2774438   0.10616367  0.18645674  0.22902395  0.25874715
   0.20361888  0.21270251  0.18199626  0.15208629  0.19376125  0.13781584
   0.24923818  0.18252062  0.18042325  0.1313761   0.18947212  0.15836606
   0.16981101  0.21390251  0.2174537   0.14944177  0.17705998  0.15363803
   0.19304376  0.20570796  0.20000107  0.22949911  0.21458859  0.23925902
   0.14829055  0.19757422  0.15830417  0.17688425  0.13862489  0.16981378
   0.11666234  0.17861415  0.1821563   0.17909927  0.18814931  0.20667867
   0.16298657  0.17001018  0.18702305  0.1657219   0.16641872  0.19719271
   0.16990173  0.15785692  0.19901048  0.17547126  0.15714247  0.20880643
   0.20683369  0.17023495  0.21964641  0.17647747  0.13692539  0.161763
   0.14295795  0.19315917  0.13190209  0.16648294  0.17604416  0.14251946
   0.18967065  0.17253374  0.14850996  0.19738942  0.16179036  0.13097343
   0.15224452  0.13680682  0.14884598  0.2143377   0.17384678  0.20395019
   0.14839282  0.13695364  0.16081427  0.18361019  0.15311611  0.14885917
   0.15806738  0.14845041  0.15678801  0.16898099  0.15748182  0.17855647
   0.18495895  0.16172342  0.14790051  0.21220212  0.17519152  0.22395772
   0.16440397  0.1072697   0.18415005  0.24166797  0.13724154  0.18197985
   0.17293248  0.17668397  0.13262193  0.1386156   0.17515236  0.13755293
   0.1441339   0.16307801  0.17421833  0.11736392  0.17855995  0.16036999
   0.15019639  0.15555971  0.15031646  0.13319613  0.16843522  0.15063944
   0.18401339  0.24554049  0.1499573   0.17797801  0.17527899  0.15611468
   0.16380461  0.15899123  0.13776116  0.37218346  0.31512953  0.27855934
   0.22070877  0.2907943   0.25654152  0.18431133  0.24270195  0.26308554
   0.17350384  0.17385293  0.19076587  0.16986063  0.1889611   0.17593662
   0.18788992  0.22974757  0.11749823  0.2084149   0.1731421   0.19370074
   0.22243389  0.17954889  0.16600966  0.16712878  0.20415758  0.1761372
   0.16867763  0.15739036  0.18183055  0.18848693  0.18032138  0.14913427
   0.2354503   0.18693329  0.22015366  0.19013928  0.18476378  0.19101486
   0.18758902  0.16895124  0.17763328  0.21384051  0.16961375  0.15371933
   0.16704595  0.1580282   0.16348454  0.14282874  0.18473474  0.18178057
   0.17938466  0.16303775  0.15348316  0.16542947  0.17216165  0.16903216
   0.18458878  0.19993282  0.15914151  0.15001657  0.16549055  0.16022506
   0.17563056  0.17508862  0.1430875   0.14884687  0.16094403  0.17436882
   0.15976347  0.15626811  0.17217156  0.16585537  0.16305683  0.32636731
   0.13108746  0.24578628  0.19618241  0.18657862  0.13086213  0.18777221
   0.16527049  0.25253241  0.14842151  0.17513853  0.18928426  0.15123462
   0.18271253  0.17182002  0.16452636  0.1677552   0.17583848  0.18566772
   0.16787989  0.20693976  0.28416186  0.24024517  0.22681607  0.17192155
   0.17556376  0.16532075  0.17633225  0.16098737  0.1645804   0.16394873
   0.19094443  0.18076215  0.16587356  0.19201778  0.14330309  0.14707114
   0.16337993  0.16036115  0.1926234   0.18330634  0.18977765  0.14334055
   0.16692254  0.16636347  0.22608373  0.16254578  0.17263986  0.22132626
   0.13508434  0.16836028  0.18418269  0.20892715  0.17728866  0.19166771
   0.22495741  0.16144024  0.1797902   0.15078664  0.14369467  0.22969099
   0.22228992  0.21618164  0.15558795  0.20209928  0.17834107  0.20684577
   0.17910771  0.16007159  0.22038974  0.16260375  0.1722482   0.19167851
   0.12735448  0.18880855  0.17895949  0.17705119  0.18840775  0.17345892
   0.18500737  0.17172817  0.17598201  0.17372527  0.23892249  0.1670451
   0.24678391  0.14733773  0.14618693  0.15412051  0.17756106  0.18308784
   0.15506896  0.18921725  0.17836726  0.15647352  0.20607065  0.16721371
   0.14242292  0.1556792   0.22411307  0.169794    0.15485257  0.25240864
   0.21915716  0.17768769  0.1593019   0.15534157  0.15020576  0.15847502
   0.13873197  0.19023071  0.15659556  0.18495221  0.15937672  0.18660596
   0.152712    0.13164849  0.2329353   0.15640763  0.14102001  0.16648381
   0.15826393  0.14069367  0.18957621  0.14178803  0.18096532  0.14131577
   0.1941288   0.19109058  0.16661067  0.17651352  0.1799318   0.15262093
   0.17690634  0.23195875  0.16623496  0.20205947  0.17337828  0.19236895
   0.1606815   0.17961851  0.12951536  0.24769281  0.15507177  0.21926565
   0.14102965  0.18343153  0.1554984   0.12918182  0.16255203  0.15926686
   0.15092435  0.11499428  0.14766813  0.14154472  0.22348022  0.18171016
   0.18944266  0.15988025  0.14733068  0.31657834  0.15979936  0.22631071
   0.20892538  0.19875991  0.17499496  0.16054076  0.23058592  0.16369005
   0.15989901  0.17277132  0.17926999  0.15978911  0.16762502  0.16131173
   0.17967423  0.24255939  0.18417054  0.1632918   0.17066328  0.17776605
   0.17532922  0.18468046  0.18262491  0.2272222   0.24592724  0.16918129
   0.20565383  0.24564054  0.16274261  0.19309549  0.15781283  0.1779562
   0.18323071  0.1754307   0.17311456  0.17825759]]

Build Models

Define model and train on X_train, y_train and validate on X_validate, y_validate

In [100]:
LSTM_model = Sequential()
LSTM_model.add(LSTM(256, input_shape = (1,X_train.shape[1])))
LSTM_model.add(Dropout(0.3))
LSTM_model.add(Dense(60))
LSTM_model.compile(loss = 'mean_absolute_error', optimizer = 'rmsprop')
#LSTM_model.compile(loss = smape, optimizer = 'rmsprop')

LSTM_model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_2 (LSTM)                (None, 256)               703488    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 60)                15420     
=================================================================
Total params: 718,908
Trainable params: 718,908
Non-trainable params: 0
_________________________________________________________________
In [101]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(LSTM_model).create(prog='dot', format='svg'))
Out[101]:
G 4449339856 lstm_2_input: InputLayer 4449341136 lstm_2: LSTM 4449339856->4449341136 4449338384 dropout_2: Dropout 4449341136->4449338384 4449337424 dense_2: Dense 4449338384->4449337424
In [25]:
epochs = 10
checkpointer = ModelCheckpoint(filepath='weights.best.from_scratch1.hdf5', verbose=1, save_best_only=True)
print('Start training...')

LSTM_model.fit(X_train1, y_train1, validation_data=(X_validate1, y_validate1), 
              epochs=epochs, callbacks=[checkpointer], verbose=1)
#LSTM_model.fit(X_train1, y_train1, validation_split=0.05, 
#               epochs=epochs, callbacks=[checkpointer], verbose=1)
Start training...
Train on 145063 samples, validate on 145063 samples
Epoch 1/10
145056/145063 [============================>.] - ETA: 0s - loss: 0.6299   Epoch 00000: val_loss improved from inf to 0.54336, saving model to weights.best.from_scratch1.hdf5
145063/145063 [==============================] - 77s - loss: 0.6299 - val_loss: 0.5434
Epoch 2/10
144960/145063 [============================>.] - ETA: 0s - loss: 0.5731  - ETA: 80s - loss: 0.5854Epoch 00001: val_loss did not improve
145063/145063 [==============================] - 77s - loss: 0.5731 - val_loss: 0.6049
Epoch 3/10
145056/145063 [============================>.] - ETA: 0s - loss: 0.5600  Epoch 00002: val_loss did not improve
145063/145063 [==============================] - 77s - loss: 0.5600 - val_loss: 0.6102
Epoch 4/10
144992/145063 [============================>.] - ETA: 0s - loss: 0.5518 Epoch 00003: val_loss improved from 0.54336 to 0.50532, saving model to weights.best.from_scratch1.hdf5
145063/145063 [==============================] - 80s - loss: 0.5518 - val_loss: 0.5053
Epoch 5/10
144960/145063 [============================>.] - ETA: 0s - loss: 0.5437  Epoch 00004: val_loss did not improve
145063/145063 [==============================] - 80s - loss: 0.5437 - val_loss: 0.5800
Epoch 6/10
144992/145063 [============================>.] - ETA: 0s - loss: 0.5384  Epoch 00005: val_loss did not improve
145063/145063 [==============================] - 81s - loss: 0.5384 - val_loss: 0.5148
Epoch 7/10
145056/145063 [============================>.] - ETA: 0s - loss: 0.5352 Epoch 00006: val_loss did not improve
145063/145063 [==============================] - 81s - loss: 0.5352 - val_loss: 0.5108
Epoch 8/10
144992/145063 [============================>.] - ETA: 0s - loss: 0.5301 Epoch 00007: val_loss did not improve
145063/145063 [==============================] - 81s - loss: 0.5301 - val_loss: 0.5069
Epoch 9/10
145024/145063 [============================>.] - ETA: 0s - loss: 0.5252 Epoch 00008: val_loss improved from 0.50532 to 0.48786, saving model to weights.best.from_scratch1.hdf5
145063/145063 [==============================] - 80s - loss: 0.5252 - val_loss: 0.4879
Epoch 10/10
144992/145063 [============================>.] - ETA: 0s - loss: 0.5238 Epoch 00009: val_loss did not improve
145063/145063 [==============================] - 84s - loss: 0.5238 - val_loss: 0.5765
Out[25]:
<keras.callbacks.History at 0x368091590>

Check the model performance by using SMAPE

In [26]:
LSTM_model.load_weights('weights.best.from_scratch1.hdf5')

########## Validate model 
y_validate_pred = LSTM_model.predict(X_validate1)
#comment for LSTM
#y_validate_pred = y_validate_pred.reshape(y_validate_pred.shape[0], y_validate_pred.shape[2])
y_validate_pred1 = pd.DataFrame(data = y_validate_pred, columns = list(y_validate.columns))

##inverse to counts of visits
y_val_pred = y_validate_pred1.apply(lambda x: np.exp(x)-1)

for i in list(y_val_pred.columns):
    y_val_pred[i] = y_val_pred[i].apply(lambda x: round(x))

y_val_pred[y_val_pred<0]=0
print('y_val_pred after inverse: ', y_val_pred.head())
print('y_validate true: ', y_validate_true.head())

print('SAMPE of validate set when training on X_train: ', smape_test(y_validate_true.stack(), y_val_pred.stack()))

##LSTM 46.39/ 45.5
('y_val_pred after inverse: ',    2016-11-02  2016-11-03  2016-11-04  2016-11-05  2016-11-06  2016-11-07  \
0        22.0        23.0        22.0        24.0        25.0        24.0   
1        23.0        24.0        25.0        28.0        29.0        28.0   
2         3.0         3.0         3.0         3.0         3.0         3.0   
3        18.0        19.0        19.0        20.0        21.0        20.0   
4        16.0        15.0        15.0        16.0        16.0        14.0   

   2016-11-08  2016-11-09  2016-11-10  2016-11-11     ...      2016-12-22  \
0        22.0        21.0        22.0        21.0     ...            21.0   
1        26.0        22.0        23.0        24.0     ...            21.0   
2         3.0         4.0         3.0         3.0     ...             3.0   
3        19.0        18.0        18.0        17.0     ...            17.0   
4        14.0        15.0        14.0        13.0     ...            13.0   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0        22.0        21.0        20.0        20.0        20.0        23.0   
1        24.0        23.0        22.0        22.0        22.0        23.0   
2         3.0         3.0         3.0         3.0         3.0         4.0   
3        19.0        18.0        17.0        17.0        18.0        20.0   
4        13.0        12.0        12.0        12.0        13.0        16.0   

   2016-12-29  2016-12-30  2016-12-31  
0        25.0        22.0        23.0  
1        25.0        23.0        24.0  
2         4.0         4.0         4.0  
3        21.0        19.0        20.0  
4        17.0        14.0        14.0  

[5 rows x 60 columns])
('y_validate true: ',    2016-11-02  2016-11-03  2016-11-04  2016-11-05  2016-11-06  2016-11-07  \
0        18.0        25.0        14.0        20.0         8.0        67.0   
1        11.0        14.0        26.0        11.0        21.0        14.0   
2         3.0         3.0         3.0         2.0        10.0         2.0   
3        12.0        11.0        15.0         7.0        12.0        13.0   
4         5.0         6.0        33.0        13.0        10.0        22.0   

   2016-11-08  2016-11-09  2016-11-10  2016-11-11     ...      2016-12-22  \
0        13.0        41.0        10.0        21.0     ...            32.0   
1        14.0        54.0         5.0        10.0     ...            17.0   
2         2.0         2.0         7.0         3.0     ...             3.0   
3         9.0         8.0        21.0        16.0     ...            32.0   
4        11.0         8.0         4.0        10.0     ...            48.0   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0        63.0        15.0        26.0        14.0        20.0        22.0   
1        42.0        28.0        15.0         9.0        30.0        52.0   
2         1.0         1.0         7.0         4.0         4.0         6.0   
3        10.0        26.0        27.0        16.0        11.0        17.0   
4         9.0        25.0        13.0         3.0        11.0        27.0   

   2016-12-29  2016-12-30  2016-12-31  
0        19.0        18.0        20.0  
1        45.0        26.0        20.0  
2         3.0         4.0        17.0  
3        19.0        10.0        11.0  
4        13.0        36.0        10.0  

[5 rows x 60 columns])
('SAMPE of validate set when training on X_train: ', 45.944525720974546)

The benchmark model

In [112]:
## benckmark: median of previous 60 days
y_train_true1 = y_train.copy()
y_train_true1['V'] = y_train_true.median(axis=1)
y_train_true1.head()

for i in list(y_validate_true.columns):
    y_train_true1[i] = y_train_true1['V']

print(y_train_true1[y_train_true1.columns[-60:]].head())
smape_test(y_validate_true.stack(), y_train_true1[y_train_true1.columns[-60:]].stack())
   2016-11-02  2016-11-03  2016-11-04  2016-11-05  2016-11-06  2016-11-07  \
0        17.5        17.5        17.5        17.5        17.5        17.5   
1        26.5        26.5        26.5        26.5        26.5        26.5   
2         4.0         4.0         4.0         4.0         4.0         4.0   
3        14.5        14.5        14.5        14.5        14.5        14.5   
4         6.0         6.0         6.0         6.0         6.0         6.0   

   2016-11-08  2016-11-09  2016-11-10  2016-11-11     ...      2016-12-22  \
0        17.5        17.5        17.5        17.5     ...            17.5   
1        26.5        26.5        26.5        26.5     ...            26.5   
2         4.0         4.0         4.0         4.0     ...             4.0   
3        14.5        14.5        14.5        14.5     ...            14.5   
4         6.0         6.0         6.0         6.0     ...             6.0   

   2016-12-23  2016-12-24  2016-12-25  2016-12-26  2016-12-27  2016-12-28  \
0        17.5        17.5        17.5        17.5        17.5        17.5   
1        26.5        26.5        26.5        26.5        26.5        26.5   
2         4.0         4.0         4.0         4.0         4.0         4.0   
3        14.5        14.5        14.5        14.5        14.5        14.5   
4         6.0         6.0         6.0         6.0         6.0         6.0   

   2016-12-29  2016-12-30  2016-12-31  
0        17.5        17.5        17.5  
1        26.5        26.5        26.5  
2         4.0         4.0         4.0  
3        14.5        14.5        14.5  
4         6.0         6.0         6.0  

[5 rows x 60 columns]
Out[112]:
47.96375802400474
In [135]:
days1 = range(550)
days2 = range(490,550)
def plot_pred(idx):
    fig = plt.figure(1,figsize=(10,5))
    plt.plot(days1,data.iloc[idx,1:], label = 'true y_validate')
    plt.plot(days2, y_train_true1.iloc[idx,-60:], label = 'benchmark')
    plt.xlabel('day')
    plt.ylabel('views')
    plt.title(data0.iloc[idx,0])
    plt.legend()
    plt.show()
In [136]:
for idx in [0, 5, 50, 10000, 50000]:
    plot_pred(idx)
In [132]:
plot_pred(10021)

Predict on X_test, and save output by Page and date

In [143]:
## to get 'Page'
X_test_P = data[['Page']+list(data.columns[-7:])]
print(X_test_P.head())
                                                Page  2016-12-25  2016-12-26  \
0            2NE1_zh.wikipedia.org_all-access_spider        26.0        14.0   
1             2PM_zh.wikipedia.org_all-access_spider        15.0         9.0   
2              3C_zh.wikipedia.org_all-access_spider         7.0         4.0   
3         4minute_zh.wikipedia.org_all-access_spider        27.0        16.0   
4  52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...        13.0         3.0   

   2016-12-27  2016-12-28  2016-12-29  2016-12-30  2016-12-31  
0        20.0        22.0        19.0        18.0        20.0  
1        30.0        52.0        45.0        26.0        20.0  
2         4.0         6.0         3.0         4.0        17.0  
3        11.0        17.0        19.0        10.0        11.0  
4        11.0        27.0        13.0        36.0        10.0  
In [144]:
######## Predict
LSTM_model.load_weights('weights.best.from_scratch1.hdf5')

y_test_pred = LSTM_model.predict(X_test1)

out_cols = list(pd.unique(key_split['date'].values))
y_test_pred1 = pd.DataFrame(data = y_test_pred, columns = list(out_cols))

##Inverse if outlier removed
y_test_pred2 = y_test_pred1.apply(lambda x: np.exp(x)-1)

y_test_pred2[y_test_pred2<0] =0

###merge into Page
test_out = pd.merge(X_test_P, y_test_pred2, left_index = True, right_index = True)
test_out1 = test_out[['Page']+list(test_out.columns.values[-60:])]
print(test_out1.head())
test_out_120D = pd.melt(test_out1, id_vars='Page', var_name='date', value_name='Visits_120D')
test_out_120D.head()
#results_120D = key_split.merge(test_out2, how ='left')
                                                Page  2017-01-01  2017-01-02  \
0            2NE1_zh.wikipedia.org_all-access_spider   20.370663   21.748503   
1             2PM_zh.wikipedia.org_all-access_spider   23.933784   25.650034   
2              3C_zh.wikipedia.org_all-access_spider    5.920140    5.330496   
3         4minute_zh.wikipedia.org_all-access_spider   15.802696   16.738407   
4  52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...   15.828934   15.892532   

   2017-01-03  2017-01-04  2017-01-05  2017-01-06  2017-01-07  2017-01-08  \
0   21.828730   23.770977   24.720915   24.130051   22.504597   20.093098   
1   27.165997   30.038921   30.678347   29.895533   27.011213   22.680622   
2    5.204183    5.739473    5.957233    5.661366    5.464618    6.192055   
3   16.776253   18.360306   19.018902   18.399969   17.278028   15.762699   
4   15.315918   16.487110   16.294899   14.938679   15.033682   16.393711   

   2017-01-09     ...      2017-02-20  2017-02-21  2017-02-22  2017-02-23  \
0   21.129581     ...       21.531868   23.720245   23.002878   21.706676   
1   24.375162     ...       21.829535   25.490503   24.915285   23.551262   
2    5.121596     ...        4.886774    5.299766    5.130033    4.994745   
3   16.211611     ...       16.333988   17.872520   17.349842   16.538219   
4   15.712793     ...       16.211548   16.024599   15.327597   15.097130   

   2017-02-24  2017-02-25  2017-02-26  2017-02-27  2017-02-28  2017-03-01  
0   22.168640   21.867777   23.469065   25.399599   23.789339   24.757181  
1   23.791195   22.805334   23.364523   25.384521   24.687803   25.847651  
2    4.874616    5.395060    6.853021    6.353125    5.611166    5.637169  
3   16.719517   16.765312   18.046919   19.609642   18.094757   18.686569  
4   15.121323   16.419050   20.017298   21.076683   17.586842   17.386002  

[5 rows x 61 columns]
Out[144]:
Page date Visits_120D
0 2NE1_zh.wikipedia.org_all-access_spider 2017-01-01 20.370663
1 2PM_zh.wikipedia.org_all-access_spider 2017-01-01 23.933784
2 3C_zh.wikipedia.org_all-access_spider 2017-01-01 5.920140
3 4minute_zh.wikipedia.org_all-access_spider 2017-01-01 15.802696
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 2017-01-01 15.828934

Retrian model on X_validate, y_validate

In [ ]:
### Retrain on X_validate
In [16]:
checkpointer = ModelCheckpoint(filepath='weights.best.from_scratch2.hdf5', verbose=1, save_best_only=True)
print('Start training...')
LSTM_model.fit(X_validate1, y_validate1, validation_split = 0.05, 
               epochs=epochs, callbacks=[checkpointer], verbose=1)
Start training...
Train on 137809 samples, validate on 7254 samples
Epoch 1/10
137760/137809 [============================>.] - ETA: 0s - loss: 0.5212 Epoch 00000: val_loss improved from inf to 0.50509, saving model to weights.best.from_scratch2.hdf5
137809/137809 [==============================] - 68s - loss: 0.5213 - val_loss: 0.5051
Epoch 2/10
137792/137809 [============================>.] - ETA: 0s - loss: 0.5170  Epoch 00001: val_loss did not improve
137809/137809 [==============================] - 73s - loss: 0.5170 - val_loss: 0.5280
Epoch 3/10
137792/137809 [============================>.] - ETA: 0s - loss: 0.5135  - ETA: 29s - loss: 0.5139Epoch 00002: val_loss did not improve
137809/137809 [==============================] - 71s - loss: 0.5135 - val_loss: 0.5415
Epoch 4/10
137728/137809 [============================>.] - ETA: 0s - loss: 0.5106  Epoch 00003: val_loss improved from 0.50509 to 0.48590, saving model to weights.best.from_scratch2.hdf5
137809/137809 [==============================] - 68s - loss: 0.5106 - val_loss: 0.4859
Epoch 5/10
137696/137809 [============================>.] - ETA: 0s - loss: 0.5070 Epoch 00004: val_loss did not improve
137809/137809 [==============================] - 69s - loss: 0.5070 - val_loss: 0.5108
Epoch 6/10
137728/137809 [============================>.] - ETA: 0s - loss: 0.5047 Epoch 00005: val_loss improved from 0.48590 to 0.48088, saving model to weights.best.from_scratch2.hdf5
137809/137809 [==============================] - 72s - loss: 0.5047 - val_loss: 0.4809
Epoch 7/10
137792/137809 [============================>.] - ETA: 0s - loss: 0.5034  Epoch 00006: val_loss did not improve
137809/137809 [==============================] - 68s - loss: 0.5034 - val_loss: 0.5041
Epoch 8/10
137696/137809 [============================>.] - ETA: 0s - loss: 0.5006 Epoch 00007: val_loss did not improve
137809/137809 [==============================] - 67s - loss: 0.5006 - val_loss: 0.4897
Epoch 9/10
137696/137809 [============================>.] - ETA: 0s - loss: 0.4983 Epoch 00008: val_loss did not improve
137809/137809 [==============================] - 68s - loss: 0.4983 - val_loss: 0.5156
Epoch 10/10
137760/137809 [============================>.] - ETA: 0s - loss: 0.4965 Epoch 00009: val_loss did not improve
137809/137809 [==============================] - 66s - loss: 0.4965 - val_loss: 0.5177
Out[16]:
<keras.callbacks.History at 0x230cd7a50>
In [17]:
LSTM_model.load_weights('weights.best.from_scratch2.hdf5')

### check model on training data
y_train_pred = LSTM_model.predict(X_train1)
y_train_pred1 = pd.DataFrame(data = y_train_pred, columns = list(y_train.columns))

y_train_pred = y_train_pred1.apply(lambda x: np.exp(x)-1)


for i in list(y_train_pred.columns):
    y_train_pred[i] = y_train_pred[i].apply(lambda x: round(x))

y_train_pred[y_train_pred<0]=0

print('SAMPE of train set when training on X_validate: ', smape_test(y_train_true.stack(),y_train_pred.stack()))
#45.52858730028269
('SAMPE of train set when training on X_validate: ', 45.38421268962829)
In [145]:
######## Predict
LSTM_model.load_weights('weights.best.from_scratch2.hdf5')

y_test_pred = LSTM_model.predict(X_test1)

out_cols = list(pd.unique(key_split['date'].values))
y_test_pred1 = pd.DataFrame(data = y_test_pred, columns = list(out_cols))

##Inverse if outlier removed
y_test_pred2 = y_test_pred1.apply(lambda x: np.exp(x)-1)

y_test_pred2[y_test_pred2<0] =0

###merge into Page
test_out = pd.merge(X_test_P, y_test_pred2, left_index = True, right_index = True)
test_out1 = test_out[['Page']+list(test_out.columns.values[-60:])]
print(test_out1.head())
test_out_60D = pd.melt(test_out1, id_vars='Page', var_name='date', value_name='Visits_60D')
test_out_60D.head()
#results_60D = key_split.merge(test_out2, how ='left')
                                                Page  2017-01-01  2017-01-02  \
0            2NE1_zh.wikipedia.org_all-access_spider   20.829733   19.966358   
1             2PM_zh.wikipedia.org_all-access_spider   23.671522   22.976276   
2              3C_zh.wikipedia.org_all-access_spider    4.940643    4.222580   
3         4minute_zh.wikipedia.org_all-access_spider   16.406521   15.573256   
4  52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...   14.735157   13.641690   

   2017-01-03  2017-01-04  2017-01-05  2017-01-06  2017-01-07  2017-01-08  \
0   19.347858   19.060459   20.668100   20.033920   17.958632   15.792356   
1   22.140463   21.594936   23.457668   22.975409   20.821579   18.449301   
2    4.176516    5.038072    4.648211    4.695458    4.006148    3.370309   
3   15.179010   14.972215   16.052105   16.108294   14.511418   12.748382   
4   13.686318   14.079979   13.923717   15.007669   13.463790   11.320927   

   2017-01-09     ...      2017-02-20  2017-02-21  2017-02-22  2017-02-23  \
0   17.159040     ...       19.748674   20.737547   20.357283   20.683899   
1   19.917358     ...       20.848572   21.287354   20.642632   20.962521   
2    3.613002     ...        4.958791    5.169774    6.168847    5.240776   
3   13.840657     ...       16.657152   17.161905   16.693483   16.600130   
4   12.125890     ...       14.628213   14.860127   14.559477   13.886269   

   2017-02-24  2017-02-25  2017-02-26  2017-02-27  2017-02-28  2017-03-01  
0   21.104708   20.044289   21.262270   21.258503   21.006023   20.127913  
1   22.056452   21.187557   22.567749   22.792959   22.342766   21.237173  
2    5.139160    5.040444    5.203341    5.261562    5.474388    6.334915  
3   17.309484   16.789948   17.842249   17.922194   17.614744   16.904827  
4   14.580499   14.883877   15.413691   15.045288   15.236240   14.811943  

[5 rows x 61 columns]
Out[145]:
Page date Visits_60D
0 2NE1_zh.wikipedia.org_all-access_spider 2017-01-01 20.829733
1 2PM_zh.wikipedia.org_all-access_spider 2017-01-01 23.671522
2 3C_zh.wikipedia.org_all-access_spider 2017-01-01 4.940643
3 4minute_zh.wikipedia.org_all-access_spider 2017-01-01 16.406521
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... 2017-01-01 14.735157

Combine results from two models

In [146]:
#### Combine two model results
results = test_out_120D.merge(test_out_60D)
results['Visits_Seq'] = results[['Visits_120D', 'Visits_60D']].mean(axis =1).apply(lambda x: round(x))
results.tail()
Out[146]:
Page date Visits_120D Visits_60D Visits_Seq
8703775 Underworld_(serie_de_películas)_es.wikipedia.o... 2017-03-01 3.716784 5.905874 5.0
8703776 Resident_Evil:_Capítulo_Final_es.wikipedia.org... 2017-03-01 0.092288 0.360114 0.0
8703777 Enamorándome_de_Ramón_es.wikipedia.org_all-acc... 2017-03-01 0.092288 0.360114 0.0
8703778 Hasta_el_último_hombre_es.wikipedia.org_all-ac... 2017-03-01 0.092288 0.360114 0.0
8703779 Francisco_el_matemático_(serie_de_televisión_d... 2017-03-01 0.092288 0.360114 0.0

Get the Id from key file

In [147]:
## Merge on 'Page' and 'date'
key_rnn_result = key_split.merge(results, how ='left')
key_rnn_result.head()
Out[147]:
Page Id date Visits_120D Visits_60D Visits_Seq
0 !vote_en.wikipedia.org_all-access_all-agents bf4edcf969af 2017-01-01 2.757196 2.864099 3.0
1 !vote_en.wikipedia.org_all-access_all-agents 929ed2bf52b9 2017-01-02 2.375104 2.431506 2.0
2 !vote_en.wikipedia.org_all-access_all-agents ff29d0f51d5c 2017-01-03 2.382111 2.420799 2.0
3 !vote_en.wikipedia.org_all-access_all-agents e98873359be6 2017-01-04 2.700510 2.568208 3.0
4 !vote_en.wikipedia.org_all-access_all-agents fa012434263a 2017-01-05 2.815577 2.485751 3.0
In [148]:
rnn_results = key_rnn_result[['Id','Visits_Seq']]
rnn_results.tail()
Out[148]:
Id Visits_Seq
8703775 f69747f5ee68 214.0
8703776 2489963dc503 237.0
8703777 b0624c909f4c 245.0
8703778 24a1dfb06c10 228.0
8703779 add681d54216 225.0

Median of Medians

In [108]:
##### last week, daily
train_bench = data[['Page'] + list(data.columns[-7:])]
train_bench['last_Visits'] = train_bench[list(train_bench.columns[-7:])].median(axis=1)
train_bench_last_week = train_bench[['Page', 'last_Visits']]

results1 = key_split.merge(train_bench_last_week, how = 'left')

results1['date'] = results1['date'].astype('datetime64[ns]')
results1['weekend'] = (results1.date.dt.dayofweek).astype(float) 
results1.tail()

##### Previous 3 weeks, DayofWeek
train_bench_3week = data[['Page'] + list(data.columns[-21:])]
train_3week_flattened = pd.melt(train_bench_3week, id_vars='Page', var_name='date', value_name='Visits_3week')
train_3week_flattened['date'] = train_3week_flattened['date'].astype('datetime64[ns]')
train_3week_flattened['weekend'] = (train_3week_flattened.date.dt.dayofweek).astype(float)
train_3week_m = train_3week_flattened.groupby(['Page','weekend']).median().reset_index()
train_3week_m['L3week_Visits'] = train_3week_m.Visits_3week
print(train_3week_m.head())
train_bench_3week = train_3week_m[['Page',  'weekend', 'L3week_Visits']]
train_bench_3week.tail()

train_bench_3week['Page'].tail(1)


results2 = results1.merge(train_bench_3week, how = 'left')
results2.tail()

### previous 9 weeks which are more than 2 months
train_bench_9week = data[['Page'] + list(data.columns[-63:])]
train_9week_flattened = pd.melt(train_bench_9week, id_vars='Page', var_name='date', value_name='Visits_9week')
train_9week_flattened['date'] = train_9week_flattened['date'].astype('datetime64[ns]')
train_9week_flattened['weekend'] = (train_9week_flattened.date.dt.dayofweek).astype(float)
train_9week_m = train_9week_flattened.groupby(['Page','weekend']).median().reset_index()
train_9week_m['L9week_Visits'] = train_9week_m.Visits_9week
print(train_9week_m.head())
train_bench_9week = train_9week_m[['Page',  'weekend', 'L9week_Visits']]
train_bench_9week.tail()

results4 = results2.merge(train_bench_9week, how = 'left')
results4.tail()


#### last year, 2016
train_bench_year = data[['Page'] + list(data.columns[-365:])]
train_year_flattened = pd.melt(train_bench_year, id_vars='Page', var_name='date', value_name='Visits_year')
train_year_flattened['date'] = train_year_flattened['date'].astype('datetime64[ns]')
train_year_flattened['weekend'] = (train_year_flattened.date.dt.dayofweek).astype(float)
train_year_m = train_year_flattened.groupby(['Page','weekend']).median().reset_index()
train_year_m['year_Visits'] = train_year_m.Visits_year
print(train_year_m.head())
train_bench_year = train_year_m[['Page',  'weekend', 'year_Visits']]
#print(train_bench_year.tail())


results6 = results4.merge(train_bench_year, how = 'left')
results6.tail()


### same month in previous years
train_bench_month = data[['Page', '2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
       '2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
       '2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
       '2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
       '2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
       '2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
       '2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
       '2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
       '2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
       '2016-02-06', '2016-02-07', '2016-02-08', '2016-02-09',
       '2016-02-10', '2016-02-11', '2016-02-12', '2016-02-13',
       '2016-02-14', '2016-02-15', '2016-02-16', '2016-02-17',
       '2016-02-18', '2016-02-19', '2016-02-20', '2016-02-21',
       '2016-02-22', '2016-02-23', '2016-02-24', '2016-02-25',
       '2016-02-26', '2016-02-27', '2016-02-28', '2016-02-29',
       '2016-03-01']]
train_month_flattened = pd.melt(train_bench_month, id_vars='Page', var_name='date', value_name='Visits_month')
train_month_flattened['date'] = train_month_flattened['date'].astype('datetime64[ns]')
train_month_flattened['weekend'] = (train_month_flattened.date.dt.dayofweek).astype(float)
train_month_m = train_month_flattened.groupby(['Page','weekend']).median().reset_index()
train_month_m['Month_Visits'] = train_month_m.Visits_month
print(train_month_m.head())
train_bench_month = train_month_m[['Page',  'weekend', 'Month_Visits']]
train_bench_month.head()


results7 = results6.merge(train_bench_month, how = 'left')
results7.tail()
/Users/cz692t/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
                                           Page  weekend  Visits_3week  \
0  !vote_en.wikipedia.org_all-access_all-agents      0.0           1.0   
1  !vote_en.wikipedia.org_all-access_all-agents      1.0           1.0   
2  !vote_en.wikipedia.org_all-access_all-agents      2.0           2.0   
3  !vote_en.wikipedia.org_all-access_all-agents      3.0           3.0   
4  !vote_en.wikipedia.org_all-access_all-agents      4.0           1.0   

   L3week_Visits  
0            1.0  
1            1.0  
2            2.0  
3            3.0  
4            1.0  
                                           Page  weekend  Visits_9week  \
0  !vote_en.wikipedia.org_all-access_all-agents      0.0           3.0   
1  !vote_en.wikipedia.org_all-access_all-agents      1.0           2.0   
2  !vote_en.wikipedia.org_all-access_all-agents      2.0           3.0   
3  !vote_en.wikipedia.org_all-access_all-agents      3.0           3.0   
4  !vote_en.wikipedia.org_all-access_all-agents      4.0           2.0   

   L9week_Visits  
0            3.0  
1            2.0  
2            3.0  
3            3.0  
4            2.0  
                                           Page  weekend  Visits_year  \
0  !vote_en.wikipedia.org_all-access_all-agents      0.0          3.0   
1  !vote_en.wikipedia.org_all-access_all-agents      1.0          3.0   
2  !vote_en.wikipedia.org_all-access_all-agents      2.0          3.0   
3  !vote_en.wikipedia.org_all-access_all-agents      3.0          3.0   
4  !vote_en.wikipedia.org_all-access_all-agents      4.0          3.0   

   year_Visits  
0          3.0  
1          3.0  
2          3.0  
3          3.0  
4          3.0  
                                           Page  weekend  Visits_month  \
0  !vote_en.wikipedia.org_all-access_all-agents      0.0           2.0   
1  !vote_en.wikipedia.org_all-access_all-agents      1.0           3.0   
2  !vote_en.wikipedia.org_all-access_all-agents      2.0           2.0   
3  !vote_en.wikipedia.org_all-access_all-agents      3.0           3.5   
4  !vote_en.wikipedia.org_all-access_all-agents      4.0           2.0   

   Month_Visits  
0           2.0  
1           3.0  
2           2.0  
3           3.5  
4           2.0  
Out[108]:
Page Id date last_Visits weekend L3week_Visits L9week_Visits year_Visits Month_Visits
8703775 龙生九子_zh.wikipedia.org_mobile-web_all-agents f69747f5ee68 2017-02-25 339.0 5.0 309.0 233.0 150.0 142.0
8703776 龙生九子_zh.wikipedia.org_mobile-web_all-agents 2489963dc503 2017-02-26 339.0 6.0 339.0 246.0 157.5 144.0
8703777 龙生九子_zh.wikipedia.org_mobile-web_all-agents b0624c909f4c 2017-02-27 339.0 0.0 302.0 217.0 132.0 136.0
8703778 龙生九子_zh.wikipedia.org_mobile-web_all-agents 24a1dfb06c10 2017-02-28 339.0 1.0 302.0 198.0 122.5 123.0
8703779 龙生九子_zh.wikipedia.org_mobile-web_all-agents add681d54216 2017-03-01 339.0 2.0 283.0 214.0 123.5 121.0
In [195]:
#### run if you want to check EDA results
results7['Visits_M'] = results7[['last_Visits', 'L3week_Visits', 'L9week_Visits', 'year_Visits', 'Month_Visits']].median(axis =1).apply(lambda x: round(x))

Combine RNN and EDA results

In [37]:
print(rnn_results.columns)
print(results7.columns)
Index([u'Id', u'Visits_Seq'], dtype='object')
Index([u'Page', u'Id', u'date', u'last_Visits', u'weekend', u'L3week_Visits',
       u'L9week_Visits', u'year_Visits', u'Month_Visits', u'Visits_M'],
      dtype='object')
In [149]:
final_results = rnn_results.merge(results7, how = 'left')
final_results['Visits'] = final_results[['Visits_Seq', 'last_Visits', 'L3week_Visits', 'L9week_Visits', 'year_Visits', 'Month_Visits']].median(axis =1).apply(lambda x: round(x))
#final_results[['Id','Visits']].to_csv('results_RNN_EDA.csv', index=False)
In [150]:
final_results.head()
Out[150]:
Id Visits_Seq Page date last_Visits weekend L3week_Visits L9week_Visits year_Visits Month_Visits Visits
0 bf4edcf969af 3.0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-01 1.0 6.0 2.0 3.0 2.0 2.0 2.0
1 929ed2bf52b9 2.0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-02 1.0 0.0 1.0 3.0 3.0 2.0 2.0
2 ff29d0f51d5c 2.0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-03 1.0 1.0 1.0 2.0 3.0 3.0 2.0
3 e98873359be6 3.0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-04 1.0 2.0 2.0 3.0 3.0 2.0 3.0
4 fa012434263a 3.0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-05 1.0 3.0 3.0 3.0 3.0 3.5 3.0

Check Performance

In [102]:
anwer_key = pd.read_csv("answer_key_1_0.csv")
anwer_key.head()
Out[102]:
Id Visits
0 bf4edcf969af 7.0
1 929ed2bf52b9 2.0
2 ff29d0f51d5c 4.0
3 e98873359be6 2.0
4 fa012434263a 4.0
In [152]:
### benckmark by using the median of views in the last 60 days
bench = data[['Page'] + list(data.columns[-60:])]
bench['last_Visits'] = bench[list(bench.columns[-60:])].median(axis=1)
benchmark = bench[['Page', 'last_Visits']]

results0 = key_split.merge(benchmark, how = 'left')
smape_test(anwer_key['Visits'], results0['last_Visits'])
/Users/cz692t/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
Out[152]:
46.510111921825725
In [109]:
#L9week_Visits
smape_test(anwer_key['Visits'], results7['L9week_Visits'])
Out[109]:
46.70122162330847
In [48]:
smape_test(anwer_key['Visits'], rnn_results['Visits_Seq'])
### 45.97 when validate_split
### 45.89
Out[48]:
45.896923710764995
In [36]:
smape_test(anwer_key['Visits'], results7['Visits_M'])
Out[36]:
44.289874695264906
In [39]:
smape_test(anwer_key['Visits'], final_results['Visits'])
### The best!!
Out[39]:
43.590704586850855
In [49]:
key_rnn_result.head()
Out[49]:
Page Id date Visits_120D Visits_60D Visits_Seq
0 !vote_en.wikipedia.org_all-access_all-agents bf4edcf969af 2017-01-01 2.757196 2.864099 3.0
1 !vote_en.wikipedia.org_all-access_all-agents 929ed2bf52b9 2017-01-02 2.375104 2.431506 2.0
2 !vote_en.wikipedia.org_all-access_all-agents ff29d0f51d5c 2017-01-03 2.382111 2.420799 2.0
3 !vote_en.wikipedia.org_all-access_all-agents e98873359be6 2017-01-04 2.700510 2.568208 3.0
4 !vote_en.wikipedia.org_all-access_all-agents fa012434263a 2017-01-05 2.815577 2.485751 3.0
In [51]:
smape_test(anwer_key['Visits'], key_rnn_result['Visits_60D'].apply(lambda x:round(x)))
Out[51]:
46.15077999144785
In [52]:
smape_test(anwer_key['Visits'], key_rnn_result['Visits_120D'].apply(lambda x:round(x)))
Out[52]:
46.86666187821527

Visualize final results

In [151]:
pred_result = final_results[['Page', 'date', 'Visits']]
pred_result.head()
Out[151]:
Page date Visits
0 !vote_en.wikipedia.org_all-access_all-agents 2017-01-01 2.0
1 !vote_en.wikipedia.org_all-access_all-agents 2017-01-02 2.0
2 !vote_en.wikipedia.org_all-access_all-agents 2017-01-03 2.0
3 !vote_en.wikipedia.org_all-access_all-agents 2017-01-04 3.0
4 !vote_en.wikipedia.org_all-access_all-agents 2017-01-05 3.0
In [139]:
### data set including the true y_test
data2 = pd.read_csv('stage_2/train_2.csv')
data2.head(20)
Out[139]:
Page 2015-07-01 2015-07-02 2015-07-03 2015-07-04 2015-07-05 2015-07-06 2015-07-07 2015-07-08 2015-07-09 ... 2017-09-01 2017-09-02 2017-09-03 2017-09-04 2017-09-05 2017-09-06 2017-09-07 2017-09-08 2017-09-09 2017-09-10
0 2NE1_zh.wikipedia.org_all-access_spider 18.0 11.0 5.0 13.0 14.0 9.0 9.0 22.0 26.0 ... 19.0 33.0 33.0 18.0 16.0 27.0 29.0 23.0 54.0 38.0
1 2PM_zh.wikipedia.org_all-access_spider 11.0 14.0 15.0 18.0 11.0 13.0 22.0 11.0 10.0 ... 32.0 30.0 11.0 19.0 54.0 25.0 26.0 23.0 13.0 81.0
2 3C_zh.wikipedia.org_all-access_spider 1.0 0.0 1.0 1.0 0.0 4.0 0.0 3.0 4.0 ... 6.0 6.0 7.0 2.0 4.0 7.0 3.0 4.0 7.0 6.0
3 4minute_zh.wikipedia.org_all-access_spider 35.0 13.0 10.0 94.0 4.0 26.0 14.0 9.0 11.0 ... 7.0 19.0 19.0 9.0 6.0 16.0 19.0 30.0 38.0 4.0
4 52_Hz_I_Love_You_zh.wikipedia.org_all-access_s... NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 16.0 16.0 19.0 9.0 20.0 23.0 28.0 14.0 8.0 7.0
5 5566_zh.wikipedia.org_all-access_spider 12.0 7.0 4.0 5.0 20.0 8.0 5.0 17.0 24.0 ... 13.0 13.0 45.0 4.0 13.0 20.0 18.0 17.0 14.0 11.0
6 91Days_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 12.0 8.0 5.0 7.0 8.0 10.0 8.0 5.0 3.0 5.0
7 A'N'D_zh.wikipedia.org_all-access_spider 118.0 26.0 30.0 24.0 29.0 127.0 53.0 37.0 20.0 ... 74.0 39.0 11.0 55.0 71.0 44.0 25.0 39.0 25.0 50.0
8 AKB48_zh.wikipedia.org_all-access_spider 5.0 23.0 14.0 12.0 9.0 9.0 35.0 15.0 14.0 ... 53.0 107.0 63.0 42.0 24.0 44.0 33.0 52.0 21.0 48.0
9 ASCII_zh.wikipedia.org_all-access_spider 6.0 3.0 5.0 12.0 6.0 5.0 4.0 13.0 9.0 ... 20.0 16.0 22.0 19.0 21.0 32.0 34.0 29.0 23.0 25.0
10 ASTRO_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN ... 17.0 17.0 25.0 4.0 29.0 14.0 44.0 62.0 47.0 16.0
11 Ahq_e-Sports_Club_zh.wikipedia.org_all-access_... 2.0 1.0 4.0 4.0 2.0 6.0 3.0 6.0 9.0 ... 10.0 13.0 8.0 16.0 6.0 12.0 6.0 8.0 6.0 13.0
12 All_your_base_are_belong_to_us_zh.wikipedia.or... 2.0 5.0 5.0 1.0 3.0 3.0 5.0 3.0 17.0 ... 7.0 4.0 5.0 10.0 7.0 4.0 2.0 6.0 2.0 22.0
13 AlphaGo_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 21.0 16.0 25.0 15.0 19.0 25.0 13.0 34.0 26.0 14.0
14 Android_zh.wikipedia.org_all-access_spider 8.0 27.0 9.0 25.0 25.0 10.0 34.0 22.0 17.0 ... 36.0 32.0 51.0 40.0 49.0 48.0 38.0 85.0 41.0 38.0
15 Angelababy_zh.wikipedia.org_all-access_spider 40.0 17.0 25.0 42.0 41.0 7.0 18.0 21.0 33.0 ... 24.0 22.0 19.0 18.0 30.0 66.0 24.0 37.0 21.0 23.0
16 Apink_zh.wikipedia.org_all-access_spider 61.0 33.0 21.0 10.0 26.0 11.0 39.0 195.0 62.0 ... 38.0 27.0 42.0 22.0 21.0 30.0 77.0 32.0 105.0 18.0
17 Apple_II_zh.wikipedia.org_all-access_spider 4.0 8.0 4.0 9.0 7.0 4.0 15.0 9.0 17.0 ... 8.0 5.0 11.0 17.0 9.0 16.0 8.0 14.0 9.0 14.0
18 As_One_zh.wikipedia.org_all-access_spider 13.0 7.0 14.0 11.0 20.0 5.0 32.0 11.0 6.0 ... 16.0 11.0 19.0 8.0 16.0 8.0 22.0 12.0 9.0 4.0
19 B-PROJECT_zh.wikipedia.org_all-access_spider NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 12.0 13.0 9.0 4.0 4.0 15.0 7.0 10.0 4.0 6.0

20 rows × 804 columns

In [141]:
data2 = data2.fillna(0)
In [166]:
benchmark.head()
for i in ['Visits' + str(i) for i in range(60)]:
    benchmark[i] = benchmark['last_Visits']
/Users/cz692t/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
In [189]:
final_results.columns
Out[189]:
Index([u'Id', u'Visits_Seq', u'Page', u'date', u'last_Visits', u'weekend',
       u'L3week_Visits', u'L9week_Visits', u'year_Visits', u'Month_Visits',
       u'Visits'],
      dtype='object')
In [196]:
results7.columns
Out[196]:
Index([u'Page', u'Id', u'date', u'last_Visits', u'weekend', u'L3week_Visits',
       u'L9week_Visits', u'year_Visits', u'Month_Visits', u'Visits_M'],
      dtype='object')
In [205]:
days1 = range(500,610)
days2 = range(550,610)
def plot_pred1(idx):
    fig = plt.figure(1,figsize=(15,8))
    print(data2.iloc[idx,0])
    plt.plot(days1,data2.iloc[idx,1+500:611], label = 'true y_test')
    plt.plot(days2, benchmark.iloc[idx,-60:], label = 'benchmark')
    pred_V = pred_result[pred_result.Page==data2.iloc[idx,0]].Visits
    plt.plot(days2, pred_V, label = 'final model')    
    pred_LSTM = final_results[final_results.Page==data2.iloc[idx,0]].Visits_Seq
    plt.plot(days2, pred_LSTM, label = 'LSTM')
    pred_M = results7[results7.Page==data2.iloc[idx,0]].Visits_M
    plt.plot(days2, pred_M, label = 'Median of Medians')    
    plt.xlabel('day')
    plt.ylabel('views')
    #plt.title(data2.iloc[idx,0])
    plt.legend()
    plt.show()
In [207]:
for idx in [0, 5, 50, 5000, 10000, 50000, 1000, 2000, 3000, 4000,5500,6000,6500,7000,7500,8000,8500,9000,9500,10500,11000,11500,12000,12500,13000,13500,14000,14500,15000,15500,16000,16500,17000,17500,18000,18500,19000]:
    plot_pred1(idx)
2NE1_zh.wikipedia.org_all-access_spider
5566_zh.wikipedia.org_all-access_spider
Fate/Zero_zh.wikipedia.org_all-access_spider
Guadeloupe_fr.wikipedia.org_desktop_all-agents
Ohi_Day_en.wikipedia.org_desktop_all-agents
Suicide_Squad_(Film)_de.wikipedia.org_all-access_spider
大魯閣草衙道_zh.wikipedia.org_all-access_spider
鄭恩地_zh.wikipedia.org_all-access_spider
陳嘉寶_zh.wikipedia.org_all-access_spider
輔大心理系性侵事件_zh.wikipedia.org_all-access_spider
Michel_de_Montaigne_fr.wikipedia.org_desktop_all-agents
Taj_Mahal_fr.wikipedia.org_desktop_all-agents
Heroes_fr.wikipedia.org_desktop_all-agents
Sylvie_Tellier_fr.wikipedia.org_desktop_all-agents
Passage_piéton_fr.wikipedia.org_desktop_all-agents
Génération_Y_fr.wikipedia.org_desktop_all-agents
Allison_Williams_(actress)_en.wikipedia.org_desktop_all-agents
Ed_Roberts_(activist)_en.wikipedia.org_desktop_all-agents
Julian_Edelman_en.wikipedia.org_desktop_all-agents
Template:Syrian_Civil_War_detailed_map_en.wikipedia.org_desktop_all-agents
Antoninus_Pius_en.wikipedia.org_desktop_all-agents
Lauren_Graham_en.wikipedia.org_desktop_all-agents
The_Originals_(TV_series)_en.wikipedia.org_desktop_all-agents
List_of_Hallmark_Channel_Original_Movies_en.wikipedia.org_desktop_all-agents
Nat_King_Cole_en.wikipedia.org_desktop_all-agents
Category:Full_nudity_commons.wikimedia.org_all-access_spider
File:Brown_noise.ogg_commons.wikimedia.org_all-access_spider
File:Wiesel_1_TOW.jpg_commons.wikimedia.org_all-access_spider
File:Eric_Stoltz-2009_cropped.jpg_commons.wikimedia.org_all-access_spider
File:Haanja_2010_01_1.jpg_commons.wikimedia.org_all-access_spider
Хэддок,_Лора_ru.wikipedia.org_mobile-web_all-agents
Ад_(Божественная_комедия)_ru.wikipedia.org_mobile-web_all-agents
Список_музыкальных_жанров,_направлений_и_стилей_ru.wikipedia.org_mobile-web_all-agents
Парад_Победы_ru.wikipedia.org_mobile-web_all-agents
Блэк,_Джек_ru.wikipedia.org_mobile-web_all-agents
Николай_II_ru.wikipedia.org_mobile-web_all-agents
Николай_Чудотворец_ru.wikipedia.org_mobile-web_all-agents
In [ ]: